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Abstract 

A treatment regime formalizes personalized medicine as a function from individ¬ 
ual patient characteristics to a recommended treatment. A high-quality treatment 
regime can improve patient outcomes while reducing cost, resource consumption, and 
treatment burden. Thus, there is tremendous interest in estimating treatment regimes 
from observational and randomized studies. However, the development of treatment 
regimes for application in clinical practice requires the long-term, joint effort of statis¬ 
ticians and clinical scientists. In this collaborative process, the statistician must in¬ 
tegrate clinical science into the statistical models underlying a treatment regime and 
the clinician must scrutinize the estimated treatment regime for scientific validity. To 
facilitate meaningful information exchange, it is important that estimated treatment 
regimes be interpretable in a subject-matter context. We propose a simple, yet flexible 
class of treatment regimes whose members are representable as a short list of if-then 
statements. Regimes in this class are immediately interpretable and are therefore an 
appealing choice for broad application in practice. We derive a robust estimator of 
the optimal regime within this class and demonstrate its finite sample performance 
using simulation experiments. The proposed method is illustrated with data from two 
clinical trials. 
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1 Introduction 


Treatment regimes formalize clinical decision making as a function from patient information 
to a recommended treatment. Proponents of personalized medicine envisage the widespread 
clinical use of evidence-based, i.e., data-driven, treatment regimes. The potential benefits 
of applying treatment regimes are now widely recognized. By individualizing treatment, a 
treatment regime may improve patient outcomes while reducing cost and the consumption 
of resources. This is important in an era of growing medical costs and an aging global 
population. However, the widespread integration of data-driven treatment regimes into 
clinical practice is, and should be, an incremental process wherein: (i) data are used to 
generate hypotheses about optimal treatment regimes; (ii) the generated hypotheses are 
scrutinized by clinical collaborators for scientific validity; (iii) new data are collected for 
validation and new hypothesis generation, and so on. Within this process, it is crucial that 
estimated treatment regimes be interpretable to clinicians. Nevertheless, optimality, not 
interpretability, has been the focal point in the statistical literature on treatment regimes. 

A treatment regime said to be optimal if it maximizes the expectation of a pre-specihed 
clinical outcome when used to assign treatment to a population of interest. There is a 
large literature on estimating optimal treatment regimes using data from observational or 
randomized studies. Broadly, these estimators can be categorized as regression-based or 
classification-based estimators. Regression-based estimators model features of the condi¬ 
tional distribution of the outcome given treatment and patient covariates. Examples include 
estimators of the regression of an outcome on covariates, treatment, and their interactions 


e.g., Su et al. 2009; Qian and Murphy, 2011 Tian et ah, 2014), and estimators of point 


treatment effects given covariates (e.g., Robins, 1994 Vansteelandt et al. 2014). Regression- 
based methods rely on correct specification of some or all of the modeled portions of the 
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conditional distribution of the outcome. Thus, a goal of many regression-based estimators 


is to ensure correct model specification under a large class of generative models (Zhao et ah 


2009 Qian and Murphy 2011 Moodie et al. 2013 Laber et ah, 2014; Taylor et ah, 2014). 


However, as flexibility is introduced into the model, interpretability tends to diminish, and 
in the extreme case the estimated treatment regime becomes an unintelligible black box. 

Classification-based estimators, also known as policy-search or value-search estimators, 
estimate the marginal mean of the outcome for every treatment regime within a pre-specihed 
class and then take the maximizer as the estimated optimal regime. Examples include 


marginal structural mean models (Robins et al. 2008 Orellana et al. 2010); robust marginal 


mean models (Zhang et ah, 2012b); and outcome weighted learning (Zhang et ah, 2012a 


Zhao et ah, 2012, 2015). Classification-based estimators often rely on fewer assumptions 


about the conditional distribution of the outcome given treatment and patient information 
and thus may be more robust to model misspecihcation than regression-based estimators 
(Zhang et al. 2012b,a). Furthermore, because classification-based methods estimate an 
optimal regime within a pre-specihed class, it is straightforward to impose structure on the 
estimated regime, e.g., interpretability, by restricting this class. We use robust marginal 
mean models with a highly interpretable yet flexible class of regimes to estimate a high- 
quality regime that can be immediately understood by clinical and intervention scientists. 

To obtain an interpretable and parsimonious treatment regime, we use the concept of 
decision list, which was developed in the computer science literature for representing flexible 


but interpretable classifiers (Rivest 1987 Clark and Niblett, 1989 Marchand and Sokolova 


2005 Letham et al. 2012 Wang and Rudin, 2014); see Freitas (2014) for a recent position 


paper on the importance of interpretability in predictive modeling and additional references 
on interpretable classifiers. As a treatment regime, a decision list comprises a sequence 
of “if-then” clauses that map patient covariates to a recommended treatment. Figure [l] 
shows a decision list for patients with chronic depression (see Section 4.2). This decision 
list recommends treatments as follows: if a patient presents with somatic anxiety score 
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Figure 1: Estimated decision list for treating patients with chronic depression. 


above 1 and retardation score above 2, the list recommends nefazodone; otherwise, if the 
patient has Hamilton anxiety score above 23 and sleep disturbance score above 2, the list 
recommends psychotherapy; and otherwise the list recommends nefazodone + psychotherapy 
(combination). Thus, a treatment regime represented as a decision list can be conveyed as 
either a diagram or text and is easily understood, in either form, by domain experts. Indeed, 


decision lists have frequently been used to display estimated treatment regimes (Shortreed 


et al., 2011; Moodie et ah, 2012 Shortreed et ah, 2014; Laber and Zhao, 2015) or to describe 


theory-based, i.e., not data-driven, treatment regimes (Shiffman 1997 Marlowe et al. 2012). 


Another important attribute of a decision list is that it “short circuits” measurement of 
patient covariates; e.g., in Figure[lJ the Hamilton anxiety score and sleep disturbance score do 
not need to be collected for patients with somatic anxiety score above 1 and retardation score 
above 2. This is important in settings where patient covariates are expensive or burdensome 


to collect (e.g., Gail et al. 1999; Gail 2009; Baker et al., 2009 Huang et al., 2015). We 
provide an estimator of the treatment regime that minimizes an expected cost among all 
regimes that optimize the marginal mean outcome. 
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2 Methodology 


2.1 Framework 

We assume that the observed data are {(Xj, A*, Yj)}” =1 , which comprise n independent iden¬ 
tically distributed observations, one for each subject in an experimental or observational 
study. Let (. X , A, Y ) denote a generic observation. Then X e M p are baseline patient covari¬ 
ates; A6i={l,... ,m} is the treatment assigned; and Y 6 M. is the outcome, coded so 
that higher values are better. A treatment regime, 7r, is a function from M. p into A, so that 
under 7r a patient presenting with X = x is recommended treatment tt(x). 

The value of a regime n is the expected outcome if all patients in the population of 
interest are assigned treatment according to n. To define the value, we use the set of 
potential outcomes {y*(a)} og ^, where Y*(a) is the outcome that would be observed if a 
subject were assigned treatment a. Define Y*( tt) = X^ a g.A^*( a )^{ 7r (^0 = a } t° be the 
potential outcome under regime t r, and R(tt) = E{y*(7r)} to be the value of regime tt. An 
optimal regime, say 7r opt , satisfies i?(7r opt ) > R(7t) for all n. Let II denote a class of regimes of 
interest. Classification-based estimation methods form an estimator of .R(7r), say R(tt), and 
then estimate 7r opt using n = argmax^gn R(tt). The success of this approach requires: (i) a 
high-quality estimator of R(tt)] (ii) a sufficiently rich class LI; and (iii) an efficient algorithm 
for maximizing R(tt) over II. We discuss these topics in the next three sections. 


2.2 Estimation of R(tt) 


We make several standard assumptions: (Al) consistency: Y = Y*(A)] (A2) no unmeasured 
confounders: {b'*(a)} ae _ 4 are conditionally independent of A given A"; and (A3) positivity: 
there exists 6 > 0 so that P(A = a|A) > 5 for all a € A. Assumption (A2) is automatically 


satisfied in a randomized study but is untestable in observational studies (Robins et al. 
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2000). Under (Al)-(A3), it can be shown (Tsiatis, 2006) that 

I {A = a) 




{Y — n(X, a)} + n(X, a) 


/ (tt(X) = a} 


( 1 ) 


where ui(x,a) = P(A = a\X = x ) and /i(x, a) = E(F|A" = x, A = a). Alternate expressions 
for R( 7 r) exist (Zhang et ah, 2012a); however, estimators based on ([Tj) possess a number of 
desirable properties (see below). 

To construct an estimator of i?( 7 r) from (JT|) we replace ui(x, a) and g,(x, a) with esti¬ 
mated working models and replace the expectation with its sample analog. If treatment is 
randomly assigned independently of subject covariates, then co(x, a) can be estimated by 
n -1 I(Ai = a). Otherwise, we posit a multinomial logistic regression model of the form 
uj(x, a) = exp (m t 7 a ) /{1 + YlJ=i ex P (w T 7 j) }, a = 1,..., m — 1, where u = u{x) is a known 
feature vector, and 71 ,..., 7 m _i are unknown parameters. Let cD(x, a) denote the maximum 
likelihood estimator of u>(x,a), where 71 , • • • , 7 m-i are replaced by maximum likelihood es¬ 
timators 71 ,... , 7 m— 1 . We posit a generalized linear model for /x(x, a), g{fi(x,a)} = z T (3a, 
where g(-) is a known link function, z = z(x) is a known feature vector constructed from 
x, and /?i,...,/3 m are unknown parameters. We use /r(x, a) = g^ 1 (z T /3 a ) as our estimator of 
n(x, a), where /3i,..., /3 m are the maximum likelihood estimators of ..., /3 m . 

Given estimators Q(x,a) and g,(x,a), an estimator of R(ir) based on ([TJ) is 


riV'i VV - a) 

.. / j / j ntx n ) 


{Yi - h(Xj, a)} + K x i, d) 


iMXi) = a}. 


( 2 ) 


For any fixed 7r, i?(vr) is doubly robust in the sense that it is a consistent estimator of R(i r) 


if either the model for ui(x,a) or /i(x, a) is correctly specified (Tsiatis, 2006; Zhang et al. 


2012b). As a direct consequence, R(ir) is guaranteed to be consistent in a randomized study, 


as oj(x, a ) is known by design. Furthermore, if both models are correctly specified, then R(ir) 
is semiparametric efficient; i.e., it has the smallest asymptotic variance among the class of 
regular, asymptotically linear estimators (Tsiatis, 2006). 
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2.3 Regimes representable as decision lists 

Gail and Simon (1985) present an early example of a treatment regime using data from the 
NSABP clinical trial. The treatment regime they propose is 


If age < 50 and PR < 10 then chemotherapy alone; 
else chemotherapy with tamoxifen, 

where age (in years) denotes the age of the patient and PR denotes the progesterone receptor 
level (in fmol). The simple if-then structure of the foregoing treatment regime makes it 
immediately interpretable. 

Formally, a treatment regime, 7r, that is representable as a decision list of length L is 
described by {(ci,ai),..., (cl, a/,), a 0 }, where Cj is a logical condition that is true or false 
for each x G M p , and a 3 G A is a recommended treatment, j — 0, ..., L. As a special case, 
L — 0 is allowed. The corresponding treatment regime {ao} gives the same treatment ao to 
every patient. Hereafter, let n denote the set of regimes that are representable as a decision 
list. Clearly, the regime proposed by Gail and Simon (1985) is a member of n. 

Define T(cj) = (i G l p : c 3 is true fora;}, j — 1,..., L; 1Z\ — T(ci), IZj = {D£<jT(q) c } f) 
T(cj), j — 2,... ,L; and 7Z 0 = T(q) c , where S c is the complement of the set S. Then 
a regime 7r G II can be written as 7r(a;) = X^=o a ^ ( x e which has structure 

If ci then ap 
else if C 2 then < 22 ; 

(3) 

else if cl then at,; 
else ao- 

In principle, the conditions c 3 , and hence the sets T(cj), can be arbitrary. To ensure 
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parsimony and interpretability, we restrict Cj so that T(cj) is one of the following sets: 


[1]: 

{x 

G 


: Xjl 

< 

n}, 




[6]: 

{x 

G 

: 

: Xjl 

< 

T\ or 

x j 2 

< 

T 2}, 

[2]: 

(x 

G 


■ x h 

< 

T\ and 

Xj 2 

< 

F 2 } j 

[7] = 

{X 

G 

: 

: x h 

< 

Ti or 

x j 2 

> 

r 2}, 

[3]: 

{x 

G 


: x n 

< 

T\ and 

x n 

> 

r 2>, 

[8]: 

{x 

G 

: 

■ x h 

> 

T\ or 

Xj2 

< 

^2}, 

[4]: 

{x 

G 


: Xjl 

> 

T\ and 

X 32 

< 

T 2}, 

[9]: 

{x 

G 

R p : 

: x h 

> 

Ti or 

JCj 2 

> 

r 2}, 

[5]: 

{x 

G 


: x h 

> 

T\ and 

X h 

> 

r 2>, 

[10]: 

{x 

G 

: 

■ x h 

> 

h}, 





where j\ < j 2 G {1,... ,p} are indices and ti,72 G M. are thresholds. We believe that the 
conditions that dictate the sets in Q, e.g., xy, < T\ and Xj 2 < r 2 , are more easily interpreted 
than those dictated by linear thresholds, e.g., (X\X, n + a 2 Xj 2 < a 3 , as the former are more 
commonly seen in clinical practice. 

In the proposed setup, at most two variables are involved in any single condition. Hav¬ 
ing a small number of variables in each clause yields two important properties. First, the 
resulting treatment regime is parsimonious, and the most important variables for treatment 
selection are automatically identified. Second, application of the treatment regime allows for 
patient measurements to be taken in sequence so that the treatment recommendation can 
be short-circuited. For example, consider a decision list described by {(ci,ai), (c 2 , a 2 ), «o}• 
It is necessary to measure the variables that compose ci on all subjects, but the variables 
composing c 2 need only be measured for those who do not satisfy ci. 

2.3.1 Uniqueness and minimal cost of a decision list 

For a decision list tt described by {(ci, aq),..., (cl,o,l), <h)} 5 let Mg denote the cost of mea¬ 
suring the covariates required to check logical conditions Ci,..., eg. Hereafter, for simplicity, 
we assume that this cost is equal to the number of covariates needed to check ci,..., eg, but 
it can be extended easily to a more complex cost function reflecting risk, burden, and avail¬ 
ability. The expected cost of applying treatment rule ir(x) = J2g=o a ^ ( x e is N(ir) = 
Y^g=\ (X e 'R't) + •A/"l1P > (X G TZq), which is smaller than Ml = Ml ^ ^e)i H ie 


Decision list it: Decision list tt'\ 

If x\ > T\ then ap If x\ < t\ and x2 > T2 then 02; 

else if X2 > T2 then 02 ; else if x\ > t\ then a±; 
else ao- else a o- 

Figure 2: Left: diagram of a decision list dictated by regions IZi = {x G M 2 : X\ > Ti}, 
1Z 2 = {i6l 2 : X\ < Ti, x 2 > r 2 }, and 1Z 0 = {x G M 2 : Xi < T\ , x 2 < t 2 }, and treatment 
recommendations cq, a 2 , and ao- Middle: representation of the decision list that requires 
only x\ in the first clause. Right: alternative representation of the same decision list that 
requires both x\ and X 2 in the first clause. 



cost of measuring all covariates in the treatment regime. This observation reflects the benefit 
of the short-circuit property. 

A decision list 7 r described by {(ci, oq),..., ( cl , cll ), no} need not be unique in that there 
may exist an alternative decision list 7 r' described by {(c^, a\), . .., (c' L ,, a' L ,), a' 0 } such that 
tt(x) = tt'(x) for all x but L 7 ^ L 1 , or L = L' but Cj 7 ^ c'- or cij 7 ^ a'- for some j G 
{1,..., L}. This is potentially important because the expected costs N(tt) and might 

differ substantially. Figure [2] shows two representations, 7r and 7 r', of the same decision list 
both with L = L' = 2 but with different clauses. The cost of the decision list in the middle 
panel, 7 r, is N(tt) = AiP(Xi > Ti)+A/" 2 P(Ad < Ti), whereas the cost of the decision list in the 
right panel, 7 r', is = J\f 2 > N(tt) with strict inequality if J\f 2 > and P(Ad > ri) > 0. 

Thus, 7 r is preferred to ti' in settings where X 2 is a biomarker that is expensive, burdenome, 


or potentially harmful to collect (e.g., Huang et ah, 2015, and references therein). 

Therefore, among all decision lists achieving the value f?,(7r opt ), where 7r opt is an optimal 
regime as defined previously, we seek to estimate the one that minimizes the cost. Defining 
C r to be the level set {77 G n : R{ 7r) = r}, then the goal is to estimate a regime in the 
set argmin 7re £( R ( 7r o P t)j N(tt). Define £(r) = {jt G n : R( n) = r}. Let n be an estimator 
of an element in the set argmax^gn R(tt). In the following we provide an algorithm that 
ensures our estimator, 7 f, belongs to the set argmin^g^r^-^ N(n), where N(tt) is dehned 
by replacing the probabilities in N(n) with sample proportions. 
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2.4 Computation 

Estimation proceeds in two steps: (i) approximate an element i r G argmax^gn i?(vr), where 
R( 7r) is constructed using (J2]): and (ii) find an element n G argmin^g^j^-) j N(n). 


2.4.1 Approximation of argmax^gn-R(vr) 


Maximizing R(n) over tt G II is computationally burdensome in problems with more than 
a handful of covariates because of the indicator functions in (J2]) and the discreteness of the 
decision list. However, the tree structure of decision lists suggests a greedy algorithm in the 


spirit of classification and regression trees (CART, Breiman et al. 1984). Assume that for 
the jth covariate, there is a candidate set of finitely many possible cutoff values Xj. These 
cutoffs might be dictated by clinical guidelines, e.g., if the covariate is a comorbid condition 
then the thresholds might reflect low, moderate, and high levels of impairment; alternatively, 
these cutoffs could be chosen to equal empirical or theoretical percentiles of that covariate. 
There is no restriction imposed on these cutoffs. Let C denote the set of all conditions that 
induce regions of the form in (jd]) with the cutoffs Tj k G X ]k for k — 1, 2, j k G {1,... ,p}. 

Before giving the details of the algorithm, we provide a conceptual overview. The al¬ 
gorithm first uses exhaustive search to find a decision list with exactly one clause, of the 


form 7r = {(ci, ai), a^}, which maximizes R(n). Let {(cy, aq), a[} denote this decision list. 
The algorithm then uses exhaustive search to find the decision list that maximizes R(vr) over 
decision lists with exactly two clauses, the first of which must be either (ci,ai) or (c( ,a\), 
where c) is the negation of C\ such that T(c() = T(ci) c ; e.g., if C\ has the form Xj 1 < iq and 
Xj 2 < 72, then would be x. jx > Ti or Xj 2 > r 2 . Although the decision lists {(ci, cii), 
and {(c^, 0 ^, 01 } yield identical treatment recommendations and have the same value, their 
first clauses are distinct, and may lead to substantially different final decision lists. Hence 
it is necessary to consider both possibilities for the first clause. The algorithm proceeds 
recursively by adding one clause at a time until some stopping criterion (described below) is 
met. 
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Hereafter, for a decision list 7r described by {(ci, ai),..., (cl, a*,), a 0 } for some L > 0, 
write 

R [{(ci, ai),..., ( cl , or,), ao}] to denote R( 7r); e.g., for L = 0, R [{ao}] is the estimated value of 
the regime that assigns treatment a o to all patients. For any decision list with a vacuous con¬ 
dition, e.g., {D e<j T (q) c } f) T(cj) = 0 for some j, define R [{(ci, cp),..., (c L , a L ), a 0 }] = -oo. 
Let z p be the lOOp percentile of the standard normal distribution. Let n tem p denote the set 
of regimes to which additional clauses can be added, and let n final denote the set of regimes 
that have met one of the stopping criteria. The algorithm is as follows, and an illustrative 

example with a step-by-step run of the algorithm is given in the Web Appendix. 

Step 1. Choose a maximum list length L max and a critical level a E (0,1). Compute 

a 0 = arg max ao£ ^ R [{a 0 }]. Set n temp = 0 and n fina i = 0. 

Step 2 . Compute (01,01,0}) = argmax (ci)0l)a / )eCXiAXiA R [{(ci, ai), ai}] and Ai = 

R [{(ci, «i), «i}] - A[{a 0 }]. If Ai < zi- a {Var(Ai) } 1/2 then let n = {o 0 }, set n fina i = {7r}, 
and go to Step 5; otherwise let 7r = {(ci, ai), a}}, 7r' = {(c}, a}), oi}, set n temp = { 71 , 71 '}, and 
proceed to Step 3, where <?, is the negation of c\. 


Step 3 . Pick an element 7r E n temp , say 7r = {(ci, Oi),..., (c,-_ 1, Uj-i), a'-_ 1}, where j — 1 
is the length of 7f. Remove 7f from n te mp- With the clauses (ci, ai),..., (cj_i, %-i) held 
fixed, compute (c^o^o}) = argmax (c . a . a , )6Cx _ 4x _ 4 R [{(cp, Oi),..., (c,-_i, a^), (c,-, aj), a}}] 
and Aj = R [{(ci,Oi),..., (cj_i, a^-i), (cj, a,-), a'}] - A [{(ci,Oi),..., (cj_i, a,-_i), a'.J]. If 
Aj < ^i_ Q ,{Var(A J ) } 1,/2 , then let 7r = {(ci,oi),..., (cj_i, Uj_i), a}„ 1 }, and set Hfi na i = 
nfinai UM; otherwise if j = L max , let 7r = {(ci, ai),..., (c,-_i, %-i), (c}, Oj), a'-}, and set 
n fina i = n fina i lj{7r}; otherwise set n temp = n temp (J{7r, 7 t'}, where c} is the negation of 9, 
7 t { (ci , Oi) ,..., (cj— 1 , Q'j— 1) , (cj , aj ) , C?} , and 7 r { (ci , 01 ),..., (c^_i , Oj_i ) , (c^ , Qj ) , aj } . 


Step 4. Repeat Step 3 until n temp becomes empty. 

Step 5. Compute 7r = argmax^n R(ir). Then 7r is the estimated optimal decision list. 
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The above description is simplified to illustrate the main ideas. The actual implemen¬ 
tation of this algorithm avoids exhaustive searches by pruning the search space C x A x A. 
It also avoids explicit construction of n temp and IIfi na i. Complete implementation details are 
provided in the Web Appendix. In the algorithm, the decision list stops growing if either 
the estimated increment in the value, Aj, is not sufficiently large compared to an estimate 
of its variation, {Var(Aj) } 1//2 , or if it reaches the pre-specified maximal length L max . We 
estimate Var(Aj) using large sample theory; the expression is given in the Web Appendix. 
This variance estimator is a crude approximation, as it ignores uncertainty introduced by 
the estimation of the decision lists; however, it can be computed quickly, and in simulated 
experiments it appears sufficient for use in a stopping criterion. The significance level a 
is a user-chosen tuning parameter. In our simulation experiments, we chose a = 0.05; re¬ 
sults were not sensitive to this choice (see Web Appendix). To avoid lengthy lists, we set 
L max = 10. Nevertheless, in our simulations and applications the estimated lists never reach 
this limit. Finally, it may be desirable in practice to restrict the set of candidate clauses so 
that, for each j, the number of subjects in 7 Zj = {^i<jT (q) c } f) (cj ) exceeds some minimal 
threshold. This can be readily incorporated into the above algorithm by simply discarding 
candidate clauses that induce partitions that contain an insufficient number of observations. 

The time complexity of the proposed algorithm is O [2 imax mp 2 {n + (rnaxj (see 

Web Appendix), where A A)- is the number of cutoff values in X r Because 2 Lmax and m 
are constants that are typically small relative to p 2 [n + (max.,- ^A)) 2 }, the time complexity 
is essentially 0(np 2 ) provided that maxj ^Aj is either fixed or diverges more slowly than 
n 1 / 2 . Hence, the time complexity is the same as a single least squares fit, indicating that the 
proposed algorithm runs very fast and scales well in both dimension p and sample size n. 

2.4.2 Finding an element of arg min 7rg £j^ ( .~. ) j N(n) 

To find an element within the set argmin^^^-^ AT(7 t), we enumerate all regimes in 
£{i?(7r)} with length no larger than L max and select among them the list with the min- 
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imal cost. The enumeration algorithm is recursive and requires a substantial amount of 
bookkeeping; therefore, we describe the basic idea here and defer implementation details to 
the Web Appendix. Suppose 7f is described by {(ci, ai),..., (cl, ol), «o}- Call a condition 
of the form x 3 < Tj an atom. There exist K < 2 L atoms, say di,... ,d,K, such that each 
clause q, ^ = 1,..., L, can be expressed using the union, intersection, and/or negation of at 
most two of these atoms. The algorithm proceeds by generating all lists with clauses rep¬ 
resentable using the foregoing combinations of at most two atoms. To reduce computation 


time, we use a branch-and-bound scheme (Brusco and Stahl, 2006) that avoids construct¬ 
ing lists with vacuous conditions or those that are provably worse than an upper bound on 
min^g^-r^-o In the simulation experiments in the next section, the average runtime 

for the enumeration algorithm was less than one second running on a single core of a 2.3GHz 
AMD Opteron™ processor and 1GB of DDR3 RAM. 


3 Simulation Experiments 

We use a series of simulated experiments to examine the finite sample performance of the 
proposed method. The average value E{i?(7r)} and the average cost E{A^(7 t)} are the pri¬ 
mary performance measures. We consider generative models with (i) binary and continuous 
outcomes; (ii) binary and trinary treatments; (iii) correctly and incorrectly specified models; 
and (iv) low- and high-dimensional covariates. The class of data-generating models that 
we consider is as follows. Covariates are drawn from a p-dimensional Gaussian distribution 
with mean zero and autoregressive covariance matrix such that cov(AA,A0) = 4(l/5)l fc_£ / 
and the treatments are sampled uniformly so that P(A = a\X = x) = 1/m for all x G 
and a G A. Let 0(x, a) be a real-valued function of x and a; given X = x and A = a, 
continuous outcomes are normally distributed with mean 2 + x\ + x 3 + x 5 + X 7 + </>{x, a) and 
variance 1, whereas binary outcomes follows a Bernoulli distribution with success probability 
expit {2 + x\ + X 3 + x$ + X 7 + cj){x , a)}, where expit (u) = exp(w)/{l + exp(w)}. Table [I] lists 
the expressions of 0 used in our generative models and the number of treatments, m, in A. 
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Under these outcome models, the optimal regime is 7r opt (a:) = argmax a a). 

For comparison, we estimate 7r opt by parametric Q-learning, nonparametric Q-learning, 


outcome weighted learning (OWL, Zhao et ah, 2012) and modified covariate approach (MCA, 


Tian et al. 2014). For parametric Q-learning, we use linear regression when Y is continuous 


and logistic regression when Y is binary. The linear component in the regression model 
has the form Y^a= 1 I(A = a)(l, X T )/3 a , where f3\,..., f3 m are unknown coefficient vectors. A 


LASSO penalty (Tibshirani, 1996) is used to reduce overfitting; the amount of penalization is 
chosen by minimizing 10-fold cross-validated prediction error. For nonparametric Q-learning, 
we use support vector regression when Y is continuous and support vector machines when 


Y is binary (Zhao et ah, 2011), both are implemented using a Gaussian kernel. Tuning 
parameters for non-parametric Q-learning are selected by minimizing 10-fold cross-validated 
prediction error. For OWL, both linear and Gaussian kernels are used and we follow the 


same tuning strategy as in Zhao et al. (2012). For MCA, we incorporate the efficiency 


augmentation term described in Tian et al. (2014). Both OWL and MCA are limited to two 
treatment options. 

To implement our method, the mean model, /z(x, a), in (JTj) , is estimated as in parametric 
Q-learning. The propensity score co(x,a) is estimated by ^ _1 AA = a )• All the 
comparison methods result in treatment regimes that are more difficult to interpret than a 
decision list; thus, our intent is to show that decision lists are competitive in terms of the 
achieved value of the estimated regime, E{i?(7r)}, while being significantly more interpretable 
and less costly. 

Results in Tableware based on the average over 1000 Monte Carlo replications with data 
sets of size n = 500 if m — 2 and Y is continuous; n = 750 if m = 3 and Y is continuous; 
n = 1000 if m — 2 and Y is binary; and n = 1500 if m — 3 and Y is binary. The value R(ji) 
and cost N(tt) were computed using an independent test set of size 10 6 . 

Table [2] shows that the decision list is competitive in terms of the value obtained across 
the entire suite of simulation experiments. If 7r opt can be represented as a decision list, the 
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Table 1: The second column gives the number of treatment options m. The third column 
gives the set of (j) functions used in the outcome models. The fourth column specifies the form 
of the optimal regime 7r opt (x) = argmax a a) where: “linear” indicates that 7r opt (x) = 
argmax a {(l, x T )/3 a } for some coefficient vectors j3 a £ M p+1 , a £ A; “decision list” indicates 
that 7r opt is representable as a decision list; and “nonlinear” indicates that 7r opt (x) is neither 
linear nor representable as a decision list. 


Setting 

m 


Expression of 4> 

Form of 7r opt 

I 

2 

* —i 

II 

= 2){3/(xi < 1,® 2 > -0.6) - 1} 

decision list 

II 

2 

<fa{x, a) = I (a 

= 2)(xi + x 2 - 1) 

linear 

III 

2 

1 —i 

II 

VT 

= 2) arctan(exp(l + aq) — 3x2 — 5) 

nonlinear 

IV 

2 

</> 4 (x, a) = I(a 

= 2)(xi - x 2 + x 3 - x 4 ) 

linear 

V 

3 

4>5(x, a) = I(a = 2){4/(xi > 1) - 2} 

+ I(a = 3)1 (x! < 1){2 1(x 2 < -0.3) - 1} 

decision list 

VI 

3 

4> 6 (x,a) = I (a 

= 2)(2xi) + I(a = 3)(-xix 2 ) 

nonlinear 

VII 

3 

*—i 

II 

= 2)(xi - x 2 ) + I(a = 3)(x 3 - x 4 ) 

linear 


proposed method produces the best value. However, even in settings in which the optimal 
regime is not a decision list, the estimated decision list appears to perform well. Recall 
that the proposed algorithm attempts to find the best approximation of the optimal regime 
within the class of regimes that are representable as a decision list. Figure [3] shows the 
average estimated decision list in misspecified settings 11 and 111 with continuous outcome 
and p — 10. In these settings, the estimated decision list provides a reasonable approximation 
of the true optimal regime. In addition, the cost of the decision list is notably smaller than 
the cost of the parametric Q-learning estimator or the MCA estimator. Nonparametric 
Q-learning OWL always use all covariates, so their costs are always equal to p. 

In the Web Appendix, we derive point estimates and prediction intervals for R(tt). We 
also present simulation results to illustrate the accuracy of variable selection for the decision 
list. 
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Table 2: The average value and the average cost of estimated regimes in simulated exper¬ 
iments. In the header, p is the dimension of patient covariates; DL refers to the proposed 
method using decision list; Q± refers to parametric Q-learning; Q 2 refers to nonparametric 
Q-learning; OWLi and OWL 2 refer to outcome weighted learning with linear kernel and 
Gaussian kernel, respectively; MCA refers to modified covariate approach with efficiency 


augmentation. OWL and MCA are not applicable under Setting V, VI and VII. 


p Setting 




Value 




Cost 


DL 

Qi 

Q 2 

OWLi 

owl 2 

MCA 

DL 

Qi 

MCA 

Continuous response 

I 

2.78 

2.53 

2.53 

2.33 

2.29 

2.54 

1.64 

9.0 

5.1 

II 

2.70 

2.80 

2.79 

2.61 

2.54 

2.80 

1.64 

9.0 

5.1 

III 

2.59 

2.54 

2.53 

2.29 

2.24 

2.55 

1.68 

9.1 

4.9 

10 IV 

2.89 

3.37 

3.35 

3.16 

3.09 

3.37 

2.50 

9.5 

7.4 

V 

2.90 

2.67 

2.59 

— 

— 

— 

1.90 

9.5 

— 

VI 

3.98 

3.46 

3.95 

— 

— 

— 

1.61 

9.2 

— 

VII 

3.22 

3.75 

3.73 

- 

- 

- 

2.56 

9.7 

- 

I 

2.76 

2.51 

2.36 

2.21 

2.19 

2.53 

1.80 

21.3 

9.2 

II 

2.70 

2.79 

2.73 

2.26 

2.27 

2.79 

1.64 

21.4 

9.3 

III 

2.59 

2.52 

2.35 

2.16 

2.12 

2.54 

1.71 

23.1 

9.0 

50 IV 

2.89 

3.36 

3.27 

2.76 

2.70 

3.36 

2.53 

25.4 

14.9 

V 

2.87 

2.63 

2.33 

— 

— 

— 

2.14 

28.5 

— 

VI 

3.95 

3.43 

3.47 

— 

— 

— 

1.69 

26.6 

— 

VII 

3.21 

3.74 

3.61 

- 

- 

- 

2.55 

30.8 

- 

Binary response 

I 

0.77 

0.74 

0.69 

0.73 

0.73 

0.74 

1.94 

8.9 

4.1 

II 

0.71 

0.72 

0.60 

0.71 

0.71 

0.72 

1.69 

9.2 

5.3 

III 

0.73 

0.73 

0.68 

0.72 

0.72 

0.73 

2.10 

9.2 

4.7 

10 IV 

0.71 

0.76 

0.66 

0.75 

0.74 

0.75 

2.40 

9.6 

8.4 

V 

0.75 

0.73 

0.62 

— 

— 

— 

2.52 

9.6 

— 

VI 

0.79 

0.75 

0.64 

— 

— 

— 

2.09 

9.5 

— 

VII 

0.77 

0.81 

0.69 

- 

- 

- 

2.83 

9.9 

- 

I 

0.76 

0.73 

0.69 

0.71 

0.70 

0.73 

2.64 

21.9 

8.3 

II 

0.71 

0.72 

0.60 

0.70 

0.69 

0.71 

1.87 

26.2 

6.4 

III 

0.73 

0.72 

0.67 

0.70 

0.69 

0.72 

2.53 

25.0 

7.3 

50 IV 

0.71 

0.76 

0.66 

0.73 

0.72 

0.74 

2.55 

31.0 

13.8 

V 

0.74 

0.72 

0.61 

— 

— 

— 

3.15 

30.4 

— 

VI 

0.78 

0.75 

0.63 

— 

— 

— 

2.41 

29.6 

— 

VII 

0.76 

0.81 

0.68 

- 

- 

- 

2.97 

35.7 

- 
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Figure 3: Left: average estimated regimes under setting II. Right: average estimated 
regimes under setting III. In both settings 7T opt cannot be represented as decision list. The 
solid line is the treatment decision boundary under 7r opt . The region where treatment 1 is 
better than treatment 2 is marked by circles, while the region where treatment 2 is better 
than treatment 1 is marked by crosses. For every point (xi, X 2 ) T , we compute the proportion 
of 1000 replications that the estimated regime recommends treatment 1 to a patient with 
covariate (xi,X 2 , 0,..., 0) G M 10 . The larger the proportion, the darker the shade. 
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4 Applications 


4.1 Breast Cancer Data 


Gail and Simon (1985) compared the treatment effects of chemotherapy alone and chemother¬ 


apy with tamoxifen using data collected from the NSABP trial. Their regime recommended 
chemotherapy alone to patients with age < 50 and PR < 10 and chemotherapy plus ta¬ 
moxifen to all others. Because the variables involved in the treatment regime constructed 
by Gail and Simon were chosen using clinical judgment, it is of interest to see what regime 
emerges from a more data-driven procedure. Thus, we use the proposed method to estimate 
an optimal treatment regime in the form of a decision list using data from the NSABP trial. 


As in Gail and Simon (1985), we take three-year disease-free survival as the outcome, 


so that Y — 1 if the subject survived disease-free for three years after treatment, and 
Y = 0 otherwise. Patient covariates are age (years), PR (frnol), estrogen receptor level 
(ER, fmol), tumor size (centimeters), and number of histologically positive nodes (num¬ 
ber of nodes, integer). We estimated the optimal treatment regime representable as a 
decision list using data from the 1164 subjects with complete observations on these vari¬ 
ables. Because treatment assignment was randomized in NSABP, we estimated c u(x, a) 
by the sample proportion of subjects receiving treatment a. Based on exploratory anal¬ 
yses, we estimated n(x, a) using a logistic regression model with transformed predictors 
z = z(x) = {age, log(l + PR), log(l + ER), tumor-size, log(l + number-of-nodes)} T . 

The estimated optimal treatment regime representable as a decision list is given in the 
top panel of Figure |4j the regime estimated by Gail and Simon is given in the bottom panel 
of this figure. The structure of the two treatment regimes is markedly similar. The treatment 
recommendations from the two regimes agree for 92% of the patients in the NSABP data. 
In this data set, 33% of the patients have a PR value less than 3; 13% of the patients have 
a PR values between 3 and 10; and 54% of the patients have a PR value greater than 10. 


In a previous analysis of the NSABP data, Zhang et al. (2012b) recommended that 
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Figure 4: Top: estimated optimal treatment regime representable as a decision list. Bot¬ 
tom: treatment regime proposed by Gail and Simon (1985). 


19 

























patients with age + 7.981og(l + PR) < 60 receive chemotherapy alone and all others re¬ 
ceive chemotherapy plus tamoxifen. However, this regime was built using only age and 
PR as potential predictors with no data-driven variable selection. In contrast, the pro¬ 
posed method selects age and PR from the list of potential predictors. For completeness, 
we also implemented parametric Q-learning using a logistic regression model with covari¬ 
ate vector z. The estimated regime recommends chemotherapy alone if 1.674 — 0.021 age — 
0.076 log(1 +PR) — 0.116 log(l + ER) — 0.024 tumor-size —0.274 log(l + number-of-nodes) > 0 
and chemotherapy with tamoxifen otherwise. The treatment recommendation dictated by 
parametric Q-learning agrees with that dictated by decision list for 86% of the subjects in 
the data set. 

To estimate the survival probability under each estimated regime, we use cross-validation. 
The data set was randomly divided into a training set containing 80% of the subjects and 
a test set containing 20% of the subjects. The optimal regime was estimated using both 
approaches on the training set, and its value was computed using (J2]) (with /2 = 0) on 
the test set. To reduce variability, this process was repeated 100 times. The estimated 
survival probability is 0.65 for the regime representable as decision list and 0.66 for the 
regime obtained from parametric Q-learning. Thus, the proposed method greatly improves 
interpretability while preserving quality. 


4.2 Chronic Depression Data 


Keller et al. (2000) compared nefazodone, psychotherapy, and combination of nefazodone 


and psychotherapy for treating patients with chronic depression in a three-arm randomized 
clinical trial. Among the three treatments considered, combination therapy was shown to 
be the most beneficial in terms of efficacy as measured by the Hamilton Rating Scale for 
Depression score (HRSD). However, the combination treatment is significantly more expen¬ 
sive and burdensome than monotherapy. Therefore, it is of interest to construct a treatment 
regime that recommends combination therapy only to subjects for whom there is a significant 
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benefit over monotherapy. 

Because lower HRSD indicates less severe symptoms, we define outcome Y = — HRSD 
to be consistent with our paradigm of maximizing the mean outcome. Patient covariates 
comprise 50 pretreatment variables, including personal habits and difficulties, medication 
history and various scores from several psychological questionnaires; a list of these variables 
is given in the Web Appendix. We estimate an optimal regime using data from the n = 647 
(of 680 enrolled) subjects in the clinical trial with complete data. Because treatments were 
randomly assigned, we estimated u>(x, a) by the sample proportion of subjects receiving 
treatment a. We estimated fi(x, a) using a penalized linear regression model with all patient 
covariates and treatment by covariate interactions. Penalization was implemented with a 
LASSO penalty tuned using 10-fold cross-validated prediction error. 

The estimated optimal treatment regime representable as a decision list is displayed in 
Figure [lj One explanation for this rule is as follows. Those with strong physical anxiety 
symptoms (somatic) and significant cognitive impairment (retardation) may be unlikely to 
benefit from psychotherapy alone or in combination with nefazodone and are therefore rec¬ 
ommended to nefazodone alone. Otherwise, because psychotherapy is a primary tool for 
treating anxiety (HAM-A) and nefazodone is associated with sleep disturbance (sleep), it 
may be best to assign subjects with moderate to severe anxiety and severe sleep disturbance 
to psychotherapy alone. All others are assigned to the combination therapy. 

The estimated regime contains only four covariates. In contrast, the regime estimated by 
parametric Q-learning using linear regression and LASSO penalty involves a linear combi¬ 
nation of twenty-four covariates, making it difficult to explain and expensive to implement. 
To compare the quality of these two regimes, we use random-split cross-validation as in 
Section 4.1. The estimated HRSD score under the regime representable as decision list is 
12.9, while that under the regime estimated by parametric Q-learning is 11.8. Therefore, by 
using decision lists we are able to obtain a remarkably more parsimonious regime with high 
quality, which facilitates easier interpretation. 
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5 Discussion 


Data-driven treatment regimes have the potential to improve patient outcomes and generate 
new clinical hypotheses. Estimation of an optimal treatment regime is typically conducted as 
a secondary, exploratory analysis aimed at building knowledge and informing future clinical 
research. Thus, it is important that methodological developments are designed to fit this 
exploratory role. Decision lists are a simple yet powerful tool for estimation of interpretable 
treatment regimes from observational or experimental data. Because decision lists can be 
immediately interpreted, clinical scientists can focus on the scientific validity of the estimated 
treatment regime. This allows the communications between the statistician and clinical 
collaborators to focus on the science rather than the technical details of a statistical model. 

Due to the “if-then” format and the conditions given in Q, the estimated regime, as a 
function of the data, is discrete. Thus, a theoretical proof of the consistency of the treatment 
recommendations using decision lists is heavily technical and will be presented elsewhere. 
We provide some empirical evidence in the Web Appendix that the estimated regime gives 
consistent treatment decisions. 
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Web Appendix for “Using decision lists to construct interpretable 


and parsimonious treatment regimes” 


A.l An Illustrative Run Through the Algorithm for 
Finding an Optimal Decision List 

In this section, we illustrate how the proposed algorithm for finding an optimal decision list 
works. For simplicity, the patient covariate is assumed to be two-dimensional. 

• The algorithm starts at Step 1. 

— We choose L max = 5 and a = 0.05. 

— We compute a-o = arg max ao£ ^ R [{a-o}]- Suppose the maximum found is R [{ao}] = 
10. Figure [5] shows the decision list {ao}. 

- We set fl temp = 0 and fl final = 0. 


• The algorithm proceeds to Step 2. 


— The goal is to estimate the first clause (ci,ai). 

— We compute (c\ ,hi, a \) = argmaX( cli01i0 /) eCx _ 4 Xv4 R [{(ci, ap, a}}]- This is done, 
conceptually, by evaluating R(-) at each element in C x A x A. Suppose the 


X2 


ao 


Everyone ao- 


X\ 


Figure 5: Diagram and description of the decision list {a 0 }. 


27 





























































X2 


If xi < n then ai; 
else a'p 


Figure 6: Diagram and description of the decision list {(ci, ax), 'a!{\. 

maximum found is R [{(ci, ai), a',}] = 15 and the clause Ci has the form x\ < T\. 
Figure [6] shows the decision list { (ci, di), a \}. 

— We compute = R [{(ci, ax), ai)] —-R [{a 0 }] and compare Ax to 2 x_ Q {Var(Ax) } 1//2 . 
In this case Ax = 15 — 10 = 5. Suppose we get Var(Ax) = 4 after calculations. 
Since 5 > 20.95 x 4 1//2 , we add two decision lists, {(ci, ax), ai} and {(c^, a}), ax}, 
into the set n tcmp and proceed to estimate the second clause (c 2 ,a 2 ). 

— We make a remark on non-uniqueness here. The decision list {(ci, Ox), ai} can be 
equivalently expressed as {(c^, a'x), ax}, where c} is the negation of ci. Since these 
two decision lists provide the same treatment recommendation to every patient, 
we have _R[{(c}, a[), ax}] = i?[{(cx, ax), a'x}] = 15. However, their hrst clauses are 
different and may lead to considerably different final decision lists. Currently it 
is impossible to determine whether (cx,ax) or (c'xjOx) should be used in the first 
clause. Thus we add both decision lists into n temp , and move on to building the 
second clause while keeping in mind that there are two possibilities, (ci,ax) and 
(cx,a'x), for the first clause. Figure [7] shows the decision list {(c^, a 7 ,), Gi}. The 
diagram is the same as in Figure [6] while the description is different. 

• The algorithm proceeds to Step 3. 

— We pick an element W from n temp . Currently n temp contains two decision lists: 
{(ci, ax), a'x} and Suppose we get W = {(ci, ax), a'x}. We remove W 
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If x'i > n then o/j; 
else ai. 


^2 



Figure 7: Diagram and description of the decision list {(c^, a^), ai}. 



Figure 8 : Diagram and description of the decision list {(ci, cp), (c 2 , a 2 ), a' 2 }. It is possible 
that a 2 = a\ or a 2 = di . 


from n temp . 

- We compute (c 2 ,d 2 ,d' 2 ) = argmax (c2)a2) ^ )eCx ^ XiA ^ [{(ci, di), (c 2 , a 2 ), a' 2 }]. During 
the maximization (d], d i) is held fixed. Intuitively, this is to partition T(ci) c while 
keeping T(ci) fixed. Suppose the maximum found is i?[{(ci,di), (d 2 ,d 2 ),d 2 }] = 
16 and the clause d 2 has the form x 2 < r 2 1 . Figure [8] shows the decision list 
{(Cl, Ol); (c 2 , ^ 2 ); ® 2 }' 

— We compute A 2 — R [{(ci,di), (c 2 ,d 2 ),d 2 }] — i? [{(ci,di),di}] and compare A 2 to 
Zi_ a {Var(A 2 ) j- 1 / 2 . In this case A 2 = 16 — 15 = 1. Suppose we get Var(A 2 ) = 2.25 
after calculations. Since A 2 < Zo. 95 {Var(A 2 ) j 1 ^, the simpler, more parsimonious 
decision list {(A, di), a\ } is preferred and added to IIfi na i, while {(ci,di), (c 2 ,d 2 ),d 2 } 
is discarded. 

• The algorithm repeats Step 3. 
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If xi > 7 ~i then a\ ] 
else if X2 < to 2 then 02; 
else a 2 . 


Figure 9: Diagram and description of the decision list {(c^, a'i), ( 02 , 02 ), D is possible 
that 02 = o'] or a 2 = a 7 ,. 


— Step 3 is repeated since n temp contains another element tt = {(2^, a[), ai}. We 
remove n from n temp . 

— We compute (c 2 ,a 2 ,a 2 ) = argmax (C2 a2 ^ )&C xAxA^ [{(^i, «!), ( c 2 , a 2 ), a' 2 }]- During 

the maximization (2^, 2^) is held fixed. Intuitively, this is to partition T{c \) while 
keeping T(2i) c fixed. Suppose the maximum found is 2^), (c 2 , 02 ), a 2 })] — 

18 and the clause c 2 has the form x 2 < r 2 2 - Figure [9] shows the decision list 

(c2,a 2 ),a' 2 }. 

— We compute A 2 = R [{(c^, «i), ( 02 , < 22 ), a^}] — -R [{(c^ , 2^), 2q}] and compare A 2 to 
^i_ Q {Var(A 2 ) } 1 ' / ~. In this case A 2 = 18 — 15 = 3. Suppose we get Var(A 2 ) = 2 
after calculations. Then we have A 2 > £o. 95 {Var(A 2 ) j^ 2 , which means that the 
second clause significantly improves the performance of the decision list. Thus we 
add decision lists {(<?,, a 7 ,), (S 2 , 02 ), a 2 } and {(2^, 2^), (c^, a 2 ), a 2 } to Il temp . 

— Here the non-uniqueness comes into play again. Consequently, although the de¬ 
cision lists {(2^,2^), (c 2 ,a 2 ),a 2 } and {(2^,2^), (2 7 2 ,a 2 ),a 2 } are equivalent, it is im¬ 
portant to have both of them added to n temp . 


• The algorithm repeats Step 3. 

— Now n te mp contains two decision lists while Hfi na i contains one. Thus Step 3 is re¬ 
peated. We first pick and remove an element tt from n temp , say 7f = {(2^, a[), (c 2 , a 2 ), a 2 }. 
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If x\ > T\ then a \; 
else if X2 < r 2 2 then 02; 
else if xi < r3i then a 3; 
else 03. 

Figure 10: Diagram and description of the decision list {(c^a^), ( 02 , 02 ), (c 3 ,a 3 ),a 3 }. Some 
of the values of a^, a 2 , 03,03 can be equal. 

— Next, we will build a decision list of length 3 and the first two clauses being (c' 1 ,a' 1 ) 

and (c 2 , o 2 ). We compute (c 3 ,a 3 ,a' 3 ) = arg max (c3 a3 a , )eCx _ 4x ^ 4 R [{(c^, a[), (c 2 , o 2 ), (c 3 , o 3 ), a£}]. 
During the maximization (c^a^) and (c 2 ,a 2 ) are held hxed. Suppose the maxi¬ 
mum found is £[{(ci,ai), (c 2 , a 2 ), (03,03), 03}] = 20 and the clause C3 has the form 
x 3 < T31. Figure [To] shows the decision list {(0^,0^), (c 2 ,a 2 ), (c 3 ,a 3 ),a3}. 

- We then compute A 3 = R [{(<?,, a',), (c 2 , o 2 ), (c 3 , a 3 ), a' 3 }]-^ [{(?,, o',), (c 2 , a 2 ), a' 2 }] 
and compare A 3 to £i_ Q {Var(A 3 ) In this case A 3 = 20 — 18 = 2. Suppose we 
get Var(A 3 ) = 3 after calculations. Then we have A 3 < -o. 95 {Var(A 3 )} 1/ T Thus 
the simpler, more parsimonious, decision list {(c^, a^), (c 2 , a 2 ), a 2 } is preferred. So 
we add {(c' 1 ,a , 1 ), (c 2 ,o 2 ), o' 2 } to fl final and drop {(c 7 1 ,a / 1 ), (c 2 ,a 2 ), (c 3 , a 3 ), a' 3 }. 


^2 



The algorithm repeats Step 3. 


— Since n temp contains one element 7 r = {(c^, a \), (H,, a 2 ), a 2 }, we repeat Step 3 once 
again. We remove 7f from n temp . 

- We compute (c 3 ,a 3 ,a' 3 ) = argmax (c3a3ta , 3)eCxAxA R [{(%,£[), (? 2 ,a' 2 ), (c 3 , a 3 ), a' 3 j] 
while keeping (c[, a^) and (c^, a 2 ) hxed. Suppose R{{(S l: 'd' l ), (c^, a 2 ), (c 3 ,a 3 ),a 3 }) = 


20.5 and the clause C 3 has the form x\ < r 3 2 and x 2 > r 33 . Figure |11| shows the 
decision list {(c^, a[), (c^, a 2 ), (c 3 , a 3 ), a' 3 }. 

— We compute A 3 = R [{(c^, a'J, (c^, a 2 ), (S 3 , a 3 ), a 3 }] — R [{(c^, a'J, (c^, a 2 ), a 2 }] and 
compare A 3 to zi_ Q {Var(A 3 ) j 1 ^. In this case Ai = 20.5 — 18 = 2.5. Suppose 
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If x\ > n then a \; 

else if X2 > T22 then a 2 ; 

else if xi < T32 and X2 > T33 then S3; 

else S3. 


Figure 11: Diagram and description of the decision list {(c), a^), (<? 2 , a 2 ), ( 03 , 03 ), a 3 }. Some 
of the values of a^, a 2 , S 3 , a 3 can be equal. 

we get Var(Ai) = 2.56 after calculations. Then we have A 3 < z 0 . 95 {Var(A 3 ) j- 1 ^. 

So the simpler, more parsimonious, decision list {(S',, a)), (c^a^),^} is preferred. 
Consequently, we add {(c^, a^), (S 2 ,S 2 ), a 2 } to IIfi na i and discard {(c^, a^), (S 2 ,S 2 ), (S3,S3), a 3 }. 

• The algorithm finishes Step 4, because n temp contains no element now. 

• The algorithm proceeds to Step 5. 

— We would like to pick a decision list from IIfi na i that maximizes R(-). 

— I 11 this example, we have three decision lists in IIfi na i: {(Si,Si), a^} with estimated 
value 15, {(S 1 ,S , 1 ), (S 2 ,S 2 ),S 2 } with estimated value 18, and {(S 1 ,S / 1 ), (S 2 ,S 2 ),S 2 } 
with estimated value 18. 

— We then choose the one with the maximal estimated value (with ties broken 
using the hrst encountered). Therefore, the estimated optimal decision list n is 
described by {(S,,S(), (S 2 ,S 2 ),S 2 |, as shown in Figure [9j 

A.2 Asymptotic Properties of R(ir) for a Given n 

We shall derive some asymptotic properties of the doubly robust estimator R(n) introduced 
in Section 2.2 in the main paper. In the next section, we will use these properties to derive 
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an estimator for Var {i^Tiy) — i?( 7 r 2 )}, which is used by our proposed algorithm for finding 
an optimal decision list. 

Hereafter denote the observed data for the zth subject by O* = (A 7 "’’, A i: Yj) T . 

We first derive an i.i.d. representation of 7 = ( 7 ^,..., ' 7 ^_ 1 ) T , the maximum likelihood 
estimator of 7 = ( 7 ^,... , r Ym-i) T i n the multinomial logistic regression model: 


P(H = a\X = x) = exp (u T 7 a ) j <|l + exp {u Tr fj) 

If u = u(x) = 1, then the maximum likelihood estimator of c <j(x, a) = P(H = a\X = x) 
reduces to E n I(A = a). Thus the multinomial logistic regression model includes the sample 
proportion as its special case. The log-likelihood function is 




n 


i=l 


E 

i =1 


m— 1 


m— 1 


E 1(Ai = a)U?la - log < 1 + ex P(^ 7 a) 

a=1 ^ a=1 

m— 1 ( m—1 

E 1(Ai = a)t/Al > a 7 - log 1 + ex P(^^a7) 


a=1 


a=1 


qxq 


where Ui = u(Xi), q is the dimension of U u and $1 = (/ g | 0 gx ( m _ 2 ) 9 ), $2 = (0 
0gx(m-3)g) j • ■ ■ j 1 = (Oqx(m — 2 )q | Iq) are (m — 1) matrices of size q x (m — 1 )q satisfying 


<h a7 = 7 a . Hence we have 


Alt ( 7 ) 
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dH t { 7 ) 
d^d^ T 


m—1 


EE 
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E E ^=<o*w - 


1 + EE exp(£/74. o7 ) 


i =1 ^ a=l 
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1 + EE exp((7-4 a 7) 


n 
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i =1 

n rv^m-1 


{EE exp(f/’'*„ 7 ) J>{ty {EE exp(f/’'*„ 7 )( 7 -J> 0 } 
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(5) 


Denote 70 as the maximizer of Ef t ( 7 ). By the likelihood theory, we have 


V^(7-7o) = -Vn 


E 


r <9 2 £ t (7, 


7 -1 


flit (70 
<97 




[ d r '/d r '/ 1 

where the partial derivatives are given in (|5]) , and o p (l) denotes a random quantity that 
convergences to zero in probability. Define 


<P 7 (0«) = - < E 


fl 2 lt(7o 

d'yd'y' 1 


} — 1 f m—1 

| E ^ = “)*k - 


ETf ^P(C^ a 7 o)»l-y. 

1 + EE exp(C/l«> a 7o) 
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Then we have 


Vn{ 7 - 7o) = E vH°i) + °p( 1 )- 


i= 1 


Next we derive an i.i.d. representation of 0 — (04---,0m) T > the maximum likelihood 
estimator of j3 — (04..., 04) T in the generalized linear model: 

m 

g {E(Yj|Xj, Ai)} = HA = a)ZlA- 

a= 1 

We assume that Y t given Ai and X,; has an distribution in the exponential family with density 
function 


fy-M = ex P 


y-fii - b(&i 


+ c{y u 0 ) \ , 


where 6 t and 0 are parameters, and &(•) and c(-, •) are known functions. Note that for 
normal distribution 0 is known as the dispersion parameter while for Bernoulli distribution 
0 is always equal to one. For simplicity we assume g(-) is a canonical link function hereafter. 
Then we have &'(•) = # _1 (-) and 4 = E^Li HA = a)ZJ/3 a - The log-likelihood function is 


2=1 
n 

E 


1 

n 

1 


'Yi Er=i HA = o)zjA - b (Er=i HA = a)zjA} 


+ c (Yi, 0 ) 


2=1 L 


'Yi HA = a)z?* a (3 - b (EI n =1 1(A = a,)Z^ a (3} 


+ c(Yi, 0) 


where r is the dimension of Z h and 4b = (l q | 0 qx ( m -i) q ), 4b = (0 qxq \ I q | 0 qx (m- 2 ) q ), • • ■, 
= (O qX (m-i)q I H) are m matrices of size q x mq satisfying 440 = 0 a . Then we have 


04(0,0) 

00 

0 2 4(0,0) 

0000 T 
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—E 

neb 

r i=i 
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^ 2=1 
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By the property of the score function, we have 
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0 2 4(0o,0o) 0 = _ 1 F 

0000 


04(00, 0Q ) 
00 


= 0 . 
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Therefore, by the likelihood theory and the property of block diagonal matrix, we conclude 
that 


where 


Vn 0 ~ A)) 



dpdp* J 
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\fn 


y <pp{Oi) +o p (i), 

1 


dio(Po, 00) 
d/3 


+ °p(1) 


<Pp(Oi) = E 


. a=l 


-1 


b" [yi(A t = a)Zj* a /3 oj a)*lZiZ?*a 

Yi - V £ HA = ajZJfyaPo U HA = aJ'flZi 1 • 


. a= l 


a =1 


Finally we derive an i.i.d. representation of R( n). To emphasize the dependence of uj(x, a ) 
and /i(x,a) on the parameters 7 and (3, in the following we write u(x,a) as co(x,a, 7) and 
ji{x,a) as iJ,(x,a,/3). Thus we have u(x,a) = u(x,a, 7) and Jl(x,a ) = /j,(x,a, (3). Note that 


w(a;,a, 7 ) 
p.(a:, a, /3) 


exp(w T <l> a 7) 
E”Li exp(u T $ i7 ) ’ 

b\z T ^ a (3), 


for a = 1,..., m, where <f> m = 0 gx ( m _i) g . Hence we have 


du(x, a, 7) 
c>7 

dfi(x, a, (3 ) 
d/3 


exp(u T $ a 7) {E™ 1 exp^/'H^/) • ($£ - $J)u} 

{EJIi exp(V r T j7 )} 
b"(z^ a (3)^ a z. 
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By Taylor expansion, we have 
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Yi - n(Xi, a, ( 3 ) \ + / i(X u a, ( 3 ) 


I{n(Xi) = a} 


n 


n m r 

EE 

2=1 a=l 
^ n m 

n 

i 

n 


/(A = a) 
u{Xi,a, 70) 

/(A,: = a) 


n m r 


S£l- ^ 2 (A'i,a,7o) 
-[(A = a) 

cu(Xi,a,7o) 
/(A = a) 


{Ej — /i(Tj, a, A)} + //(At, <T A,) 

{E-p(Al,aAo)}/{vr(Al) = a} 


/{tt(A) = a} 
du>(Xi, a, 70) 


^7 


(7 - 7o) 


EE 

2=1 a=l 
n m r 

EE 

2=1 < 2=1 

t m 

E 

a=l 
/ m 

ME 


+ 1 ^ /{M'.Y.j = a} 


dfjL(Xi,a,p 0 ) 


dp 


(M ~ Po) + o p (l) 


w(Xi, a, 70) 
/(A = a) 


L w 2 (^i,a,7o) 
-/’(A = a) 


{Ej - p(Aj, a, Mo)} + //(A*, a, Mo) 
{Y i - f i(X i ,a,p 0 )}I{Tr(X i ) = a} 


I{n( X t ) = a} 
dui(Xii a, 70) 


^7 


(7 - 7o) 


v. a=l L 


Recall that 
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where dco/d'y and d[i/dp are given in (J 6 ]) . Then we have 
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Therefore, by the central limit theorem and the Slutsky’s theorem, we conclude that 

{-R(tt) - i?(7r)} 4 iV(0, E K(0*)}), (8) 

where 4 denotes convergence in distribution. 

To estimate the asymptotic variance, we use the plug-in method. Namely, define ) 

as in ([7]) except that expectations are replaced with sample averages and true values are re¬ 
placed with corresponding estimates. Then Var ( R(n) j can be estimated by fi 2 R(Oi)/n 2 . 


A.3 Asymptotic Properties of R(n\) — R( 772 ) 

Dehne tpRi(Oi) as in ([?]) with n replaced by n\. Dehne <p_R 2 (Oj) as in ([?]) with n replaced by 
7T2- Dehne (f>Ri(Oi ) and (pR 2 {Oi ) similarly. Then we have 


- R(n 2 ) 


{^(TrO - i?(7T 2 )} 



^ Wm(Oi) _ PraiOi)} + o p (l) 

i =1 


4 1V(0, E {(pRl(Oi) — (fR2(Oi)} 2 ). 


Therefore, we can estimate Var |f?(7r 1 ) — f?(7r 2 )| by 

1 n 

Var |f?(vr 1 ) - ^(tt 2 )| = ^ {<fRi(Oi) - ^i? 2 (<A)} 2 . (9) 

2=1 

The variance estimator Var(Aj) used in the algorithm in Section 2.4.1 in the main 
paper can be obtained via @ with H\ = {(ci, ai),..., (cj_i, Uj-i), {cj, aj), a' } and tt 2 = 

{ (Cl, Ol), • • • , (Cj — i, CLj — i), O’j—i } • 

A.4 Implementation Details of Finding an Optimal De¬ 
cision List 


A.4.1 Algorithm Description 

We give an equivalent version of the proposed algorithm for hireling an optimal decision list. 
Compared to the algorithm presented in the main paper, this version makes use of recursive 
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calls to avoid explicit constructions of sets n temp and II fina i, and facilitates the analysis of 
time complexity. The algorithm is as follows. 

Input: R(-), L max , a 

Output: a decision list tv that maximize R(-) 
a 0 = &Ygmax aoeA R [{a 0 }]; 
tx = FindList(l, {}, ao); 

The function FindList is defined below. When j = 1, we treat (ci,Oi),..., i) 

as an empty array. Thus when j = 1, {(ci,ai),..., (cj-i, %-i), (cj, a?), a' } is the same as 
{(ci,ai),ai} and {(hi, ai),..., (cj_i, aj_i), a'_ 1 } is the same as {a' 0 }. 

In the FindList function, a crucial step is to compute (’ Cj,a,j,a'j ). A straightforward 
implementation that involves a brute-force search over C x A x A can be time consuming. 
We provide an efficient implementation below. 

We observe that some calculations can be performed only once at the beginning of the 
algorithm. First, define 

iia = — j ^- % Wi ~n(Xi,a,p)\ + n(Xi, a, 0). 

w{Xi,a, 7 ) l J 

Then we have 

n m 

A(tt) = - £iJ{n{Xi) = a}. 

i= 1 a =1 

Second, for the ith subject, denote x l3 as the observed value of his/her jth covariate. For the 
jth baseline covariate, there are sy, = possible candidate cutoff values r 3 !<•••< Tj S ., 
which divides the real line into Sk + 1 intervals: 


( (X), Tji], (TjI , Tj'2], •••, (^'(aj-l)) 

Then we code the observed values xij ,..., x n] into indices bij,... ,b n j according to which 
interval they fall. 
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Function FindList (j, {(ci, ai),..., (cj-i, %-i)} , a'-_ ly ) 

(cj, aj, dj) arg max^ ^ a /) eCx _ 4 X _4 7? ®i)> • •• j (^j— i>®j—1)> ipji ®j}]) 

A,= 

[{ (/ 1 ? ®l)) • • • ? (Cj —1 j Oj —l) j (Cj, Oj) i dj j’ j i? [{ (Ci, Qi), . . • , (Cj_i, Oj_i), }] > 

if Aj < ^i_ Q {Var(A : ,) } 1//2 then 

yi { (Ci, di ), ... , (Cj— 1; 1) j Q'j— 1} > 

else if j = L max then 

/I { (Ci, Ox), . . . , (Cj_i, i), (Cj, Oj) , Cbj J-, 

else 

7?! = FindList(j + 1, {(ci,ai),..., (cj_i, aj_i), (cj,a,-)} , a'); 
tt 2 = FindList(j + 1, {(ci, a x ),..., (Cj_i, aj_i), (c^,a'-)J , a,-), 
where c* = negation of c,-; 

7T = argmax 7re{ ~ l ~ 2} ^(7r);: 

end 

return 7r; 
end 
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In order to reduce the number of evaluations of /?,(•) when searching for the maximizer 
over C x A x A, we organize the intermediate results as shown below. Let X = {i : Xi e 
T(q) c for all £ < j}. Then X contains all the subjects that have not had treatment recom¬ 
mendations up to the jth clause. Since we have 

m m 

nR{ 7 r) = ^2^2^ ia I{n(Xi) = a} + ^ ^ = a} 

i£X a= 1 zGX c a= 1 


and XhiGi* — a } is constant during the maximization, we focus on maxi¬ 
mizing XhiGX YXi=\ — a}, which reduces to maximizing 


X e T(cj),a= Qj] + f Tie,),a - a'}. 


( 10 ) 


iSX a=l 


i£l a =1 


To identify the maximizer of (10), we first loop over all possible pairs of covariates. For each 
pair of covariates, say the kth and the £th covariates, define D , a three-dimensional array 
of size m x (, s k + 1) x (s e + 1), as D auv = = u,b u = v). Next, we loop over 

all possible cutoff values and construct the corresponding Cj. The values of dj and a'- that 


maximizes (10) for a given c 3 can be easily obtained due to the additive structure. After 
enumerating all the possible conditions that c 3 may take, we can find out (c 3 , 7x 3 , o'-). 


A.4.2 Time Complexity Analysis 

Since computing £ ja s requires 0(nm ) time and computing bijS requires 0(np ) time. The 
calculations at the beginning of the algorithm take 0(nm + np) time in total. 

The algorithm first computes clq, which requires 0(nm) time. Then it invokes a function 
call FindList(l, {}, ao). Due to the recursive nature of the FindList function, we will 
compute the time complexity by establishing a recurrence relation between T(j ) and T(j+ 1), 
where T(j ) is the time complexity of the function call FindList (j, {(ci,a!),..., (c 3 ,a 3 )}). 

Suppose a call FindList (j, {(ci,ai),..., (cj,cij)}) is invoked. The running time can be 
computed by going through the algorithm of the FindList function step-by-step as follows. 

First, the function computes (cj, a 3 , a' ). A naive implementation would involve looping 
over all the covariates, all the possible cutoff values and all the treatment options, whose 
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running time is 0(nmp 2 s 2 ), where s = max,,- Sj. However, the running time is greatly reduced 
if we use the efficient implementation described previously. For a given pair of covariates, we 


can compute D auv s in 0(nm ) time. Then we can fold out the maximum of (10) in 0(ms 2 ) 
time by looping over all possible cutoff values. Therefore, the total time for computing 
( Cj,aj,a'j ) is 0{(n + s 2 )mp 2 }. 

Second, the function computes A j, which takes 0(n ) time. 

Third, the function computes Var(Aj), whose running time is 0(nmq + nmr), where q 
is the dimension of Ui and r is the dimension of Z { . Since both U t and Zi are known feature 
vectors constructed from X l} for most cases q and r are of the same order as p. So this step 
takes 0(nmp) time. 

Fourth, the function executes the “if-then” statement. In the worst case, the function 
makes two recursive calls, taking 2 T(j + 1) time. 

Combining these four steps, we have T(j ) = 0{{n + s 2 )mp 2 } + 2 T(j + 1). The bound¬ 
ary condition is T(L max ) = 0{(n + s 2 )mp 2 }. Using backward induction, we get T(0) = 
0{2 Lmax (n + s 2 )mp 2 }. Recall that s = inaxj #Xj. 

Combining T(0) with the running time before invoking FindList(l, {}, oo), we obtain 
that the time complexity of the entire algorithm is 0[2 Lmax mp 2 {n + (rnaxj #Xj) 2 }}. 


A.5 Implementation Details of Finding an Equivalent 

Decision List with Minimal Cost 

In this section we give an algorithmic description of the proposed method for finding an 
equivalent decision list with minimal cost. Recall that two decision lists are called equivalent 
if they give the same treatment recommendation for every patient in the population. 

The function FindMinCost is defined below. 
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Input: a decision list 7f 

Output: an equivalent decision list 7r m i n with minimal cost N m ; n 
Identify atoms in 7f as di,..., dx] 

Compute X a = {i : 7t(X,) = a} for each a G A] 

Set 7r min = {} and N min = oo; 

FindMinCost (0, {}, vr min , N min ); 


A.6 Point Estimate and Prediction Interval for R(tt) 
with Bootstrap Bias Correction 

In this section, we show how to estimate the value of the estimated treatment regime, 
and how to construct a prediction interval for it. 


A.6.1 Methodology 


To measure how well the estimated treatment regime n performs, it is often of interest to 
construct an estimator of and a prediction interval for R(tt). Since a natural candidate for 
estimating R(n) is R(n), it may be tempting to construct a prediction interval centering at 
R(tt). However, R(n) is generally too optimistic to serve as an honest estimator of R(n). 
It has an upward bias due to the maximization process. As a remedy, we suggest using 
B bootstraps to correct this bias. Specifically, the perturbed version of R(n) in the 6th 
bootstrapping sample is 


R* b (n) 



I(Aj = a) 
cu(Xj,a,7*) 




+ ^(Xi, a, f 3 *) 


/MX,) 



5 
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Function FindMinCost (j, {(ci, op), ..., ( Cj , %)}, n min , N min ) 

Compute a lower bound of the cost as N hd = J\fe Y^i=i ^n(X G Hf) 

+ AfjF n (X G fl^ =1 7 Zf), where P„ denotes the empirical probability measure; 
if A^,d > ./V min then return; 

X={i: Xi G T(q) c for all i < j}; 
if Z C Z O0 for some a 0 then 

if iV[{(ci, ai),..., (q, a,), a 0 }] < N mm then 

^min {(Cl, Ol), • • • , (Cj, Oj), do}j 

-^min -^(^"min); 

end 

else 

for 1 < k\ < k,2 < K do 

Let Cfc, be the set consisting of all the logical clauses involving 

dfcj or d fc2 or both using conjunction, disjunction, and/or negation; 
for Cj- G ^ki,k2 

J/j+i = {i G Z : Xi G T(cj + i)}; 

if is non-empty and Jj+i C Z aj+1 for some a 3+ \ G * 4 . then 

FindMinCost (j + 1 , {(ci, ai),..., (9, %), (c i+ i, a i+ i)} j TTmin ? A^ m j n ) , 

end 

end 

end 

end 
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where W ±,..., W n are identically and independently distributed with standard exponential 
distribution, 7 * is the solution to 


n ( m—1 

J2 w ‘\E IiA ‘ 


1=1 \ a= 1 

and j3* is the solution to 


aW a Ui 


l + Er/exp^T) 


0 , 




i =1 


Y i -b'\Y t I(A i = a)Z?* a p 


. a= 1 


m 


. a= 1 


I{Ai = a)^lZi \ = 0 . 


Let be the maximizer of R* b ( 7 r) over II. Then the actual bias R(jr)—R(jt) can be estimated 

by the corresponding bias in the bootstrap world: bias = Ylb=i{Rb(^b) ~ -^(^&)}/-^> where 
B is the number of bootstrap samples. The final estimator of R(n) is R c (n) = R(i f) — bias. 
To construct a prediction interval for R(tt), we treat if as a non-random quantity, and 


then utilize the asymptotic normality of R(tt) given in (| 8 j). Let z p be the lOOp percentile 
of a standard normal distribution and a 2 = Var{i?( 7 r)}. Then a (1 — a) x 100% prediction 
interval for R(i f) is 


[-Rc(tt) + z a/2 a, R c (tt) + ^i_ a / 2 a] . 


( 11 ) 


A potential drawback of this interval is, though, that it ignores the variation introduced 
by tt. Nevertheless, our numerical experiences suggest that this extra variation is generally 
small and the coverage probability is close to the nominal level. Taking into account the 
variability of tt has to deal with the associated non-regularity issue, which is beyond the 
scope of this paper. 

As a final remark, for binary outcome we suggest to conduct the bias correction and 
construct the prediction interval based on logit{i?(-)} first and then transform back to the 
original scale, where logit(u) = log{w/(l — v)}. 


A.6.2 Simulations 

We present the point estimate and the coverage probabilities of the plain prediction interval 
and the prediction interval with bootstrap bias correction in Table [3} The setting used here is 
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Table 3: Point estimate and coverage probabilities of prediction intervals with and without 
bootstrap bias correction. Plain-PI refers to the coverage probability of the plain prediction 
interval, and Corrected-Pl refers to the coverage probability of the bias-corrected prediction 


interval. 


p 

Setting 


Continuous response 


Binary response 


R c (n) 

Plain-PI 

Corrected-PI IUtt) 

R c (n) 

Plain-PI 

Corrected-PI 


1 

2.78 

2.78 

0.95 

0.94 

0.77 

0.76 

0.97 

0.96 


11 

2.70 

2.73 

0.93 

0.95 

0.71 

0.72 

0.89 

0.97 


III 

2.59 

2.61 

0.95 

0.95 

0.73 

0.74 

0.88 

0.96 

10 

IV 

2.89 

2.98 

0.88 

0.94 

0.71 

0.72 

0.63 

0.98 


V 

2.90 

2.90 

0.95 

0.95 

0.75 

0.75 

0.76 

0.96 


VI 

3.98 

4.01 

0.93 

0.95 

0.79 

0.79 

0.97 

0.99 


VII 

3.22 

3.27 

0.86 

0.94 

0.77 

0.77 

0.77 

1.00 


I 

2.76 

2.75 

0.94 

0.94 

0.76 

0.76 

0.82 

0.98 


II 

2.70 

2.72 

0.93 

0.94 

0.71 

0.71 

0.80 

0.96 


III 

2.59 

2.59 

0.94 

0.95 

0.73 

0.73 

0.63 

0.98 

50 

IV 

2.89 

2.96 

0.88 

0.94 

0.71 

0.72 

0.48 

0.98 


V 

2.87 

2.87 

0.93 

0.94 

0.74 

0.74 

0.33 

0.96 


VI 

3.95 

3.99 

0.91 

0.94 

0.78 

0.79 

0.89 

0.99 


VII 

3.21 

3.27 

0.88 

0.94 

0.76 

0.77 

0.63 

0.99 


exactly the same as that in Section 3 in the main paper. We can see that the bias correction 
improves the coverage probability substantially in finite samples, especially as the number 
of covariates gets larger. Besides, the bootstrap prediction interval is prone to overcoverage 
for the binary response. 


A. 7 Accuracy of Variable Selection 

Consider the simulated experiments in the main paper. To quantify variable selection ac¬ 
curacy, we compute the true positive rate, the number of signal variables included in the 
decision list divided by the number of signal variables, and the false positive rate, the number 
of noise variables included in the decision list divided by the number of noise variables. A 
variable is called a signal variable if it appears in (j)(x, a) and is a noise variable otherwise, 
irrespective of the actual functional form. 

Table [4]presents the true positive rates and the false positive rates under different settings. 
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Table 4: Accuracy of variable selection using decision list. TPR is the true positive rate and 


FPR is the false positive rate. 


p 

Setting 

Continuous response 

Binary response 

TPR 

FPR 

TPR 

FPR 


I 

1.00 

0.00 

1.00 

0.07 


II 

1.00 

0.00 

1.00 

0.04 


III 

1.00 

0.00 

1.00 

0.11 

10 

IV 

0.93 

0.00 

0.79 

0.07 


V 

1.00 

0.07 

1.00 

0.20 


VI 

1.00 

0.05 

1.00 

0.10 


VII 

0.94 

0.00 

0.98 

0.04 


I 

1.00 

0.01 

1.00 

0.04 


II 

1.00 

0.00 

1.00 

0.02 


III 

1.00 

0.00 

0.99 

0.04 

50 

IV 

0.93 

0.00 

0.73 

0.02 


V 

1.00 

0.02 

0.99 

0.06 


VI 

1.00 

0.02 

1.00 

0.03 


VII 

0.94 

0.00 

0.97 

0.02 


The proposed method consistently identifies signal variables and screens out noise variables 
in most settings. The only exception is setting IV, where the optimal regime is far away 
from being well approximated by decision lists. Thus the proposed approach loses power in 
detecting useful covariates due to misspecifying the form of the regime. 


A.8 Impact of the Tuning Parameter in the Stopping 
Criterion 

In the algorithm discussed in Section 2.4.1 in the main paper, we use a tuning parameter a 
to control the building process of the decision list and we suggest to fix a at 0.95. In the 
following we show that the final decision list is insensitive to the choice of a via simulation 
study. The setting used here is exactly the same as that in Section 3 in the main paper. We 
varied a among {0.9, 0.95, 0.99}. 

Table [5] shows the impact of a on the value and the cost of the estimated regime. We can 
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see that the value and the cost as well as the accuracy of variable selection, averaged over 
1000 replications, are very stable across different choices of a. Table [ 6 ] shows the impact of a 
on the estimated regime. It is clear that a has little impact on the treatment recommendation 
made by the estimated regime. 


A.9 Chronic Depression Data 


In the application considered in Section 4.2 in the main paper, we applied the proposed 
method to construct an interpretable and parsimonious treatment regime. We follow ? and 


Zhao et al. (2012), and use the following 50 covariates: 


1. Gender: 1 if female, 0 if male; 

2. Race: 1 if white, 0 otherwise; 

3. Marital status I: 1 if single, 0 otherwise; 

4. Marital status II: 1 if married or living with someone, 0 otherwise; 

5. Body mass index: continuous; 

6. Age at screening: continuous; 

7. Having difficulty in planning family activity: 1 if strongly agree, 2 if agree, 3 if disagree, 
4 if strongly disagree; 

8 . Supporting each other in the family: 1 if strongly agree, 2 if agree, 3 if disagree, 4 if 
strongly disagree; 

9. Having problems with primary support group: 1 if yes, 0 if no; 

10. Having problems related to the social environment: 1 if yes, 0 if no; 

11. Having occupational problems: 1 if yes, 0 if no; 
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Table 5: The impact of a on the value and the cost of the estimated regime. In the header, 
a is the tuning parameter in the stopping criterion; R(n) is the mean outcome under the 
estimated regime tt, computed on a test set of 10 6 subjects; N(jc) is the cost of implementing 
the estimated regime n, computed on the same test set; TPR is the true positive rate, namely, 
the number of signal variables involved in n divided by the number of signal variables; FPR 
is the false positive rate, namely, the number of noise variables involved in n divided by the 


number of noise variables. Recall that p is the dimension of patient covariates. 


p Setting 


a = 

0.9 



a = 

0.95 



a = 

0.99 















R(n) 

N(7r) 

TPR 

FPR 

R(9) 

N(n) 

TPR 

FPR 

R{n) 

Nil r) 

TPR 

FPR 

Continuous 

response 











I 

2.78 

1.65 

1.00 

0.01 

2.78 

1.65 

1.00 

0.00 

2.78 

1.65 

1.00 

0.00 

II 

2.71 

1.66 

1.00 

0.00 

2.70 

1.66 

1.00 

0.00 

2.69 

1.66 

1.00 

0.00 

III 

2.59 

1.69 

1.00 

0.00 

2.59 

1.69 

1.00 

0.00 

2.59 

1.69 

1.00 

0.00 

10 IV 

2.89 

2.51 

0.93 

0.00 

2.89 

2.51 

0.93 

0.00 

2.89 

2.51 

0.92 

0.00 

V 

2.90 

1.91 

1.00 

0.07 

2.90 

1.91 

1.00 

0.07 

2.90 

1.91 

1.00 

0.07 

VI 

3.98 

1.61 

1.00 

0.06 

3.98 

1.61 

1.00 

0.05 

3.98 

1.61 

1.00 

0.05 

VII 

3.22 

2.56 

0.94 

0.00 

3.22 

2.56 

0.94 

0.00 

3.21 

2.56 

0.94 

0.00 

I 

2.75 

1.96 

1.00 

0.01 

2.76 

1.96 

1.00 

0.01 

2.78 

1.96 

1.00 

0.00 

II 

2.70 

1.66 

1.00 

0.00 

2.70 

1.66 

1.00 

0.00 

2.69 

1.66 

1.00 

0.00 

III 

2.58 

1.75 

1.00 

0.00 

2.59 

1.75 

1.00 

0.00 

2.59 

1.75 

1.00 

0.00 

50 IV 

2.89 

2.54 

0.93 

0.00 

2.89 

2.54 

0.93 

0.00 

2.89 

2.54 

0.92 

0.00 

V 

2.87 

2.19 

1.00 

0.03 

2.87 

2.19 

1.00 

0.02 

2.88 

2.19 

1.00 

0.02 

VI 

3.95 

1.70 

1.00 

0.02 

3.95 

1.70 

1.00 

0.02 

3.95 

1.70 

1.00 

0.02 

VII 

3.22 

2.56 

0.94 

0.00 

3.21 

2.56 

0.94 

0.00 

3.21 

2.56 

0.93 

0.00 

Binary response 

I 

0.76 

2.16 

1.00 

0.12 

0.77 

2.16 

1.00 

0.07 

0.77 

2.16 

1.00 

0.02 

II 

0.71 

1.75 

1.00 

0.06 

0.71 

1.75 

1.00 

0.04 

0.71 

1.75 

1.00 

0.02 

III 

0.73 

2.24 

1.00 

0.15 

0.73 

2.24 

1.00 

0.11 

0.74 

2.24 

0.99 

0.04 

10 IV 

0.71 

2.48 

0.81 

0.08 

0.71 

2.48 

0.79 

0.07 

0.71 

2.48 

0.70 

0.06 

V 

0.75 

2.64 

1.00 

0.24 

0.75 

2.64 

1.00 

0.20 

0.75 

2.64 

1.00 

0.16 

VI 

0.79 

2.11 

1.00 

0.11 

0.79 

2.11 

1.00 

0.10 

0.79 

2.11 

1.00 

0.10 

VII 

0.77 

2.87 

0.98 

0.06 

0.77 

2.87 

0.98 

0.04 

0.76 

2.87 

0.97 

0.02 

I 

0.75 

2.87 

1.00 

0.05 

0.76 

2.87 

1.00 

0.04 

0.76 

2.87 

1.00 

0.02 

II 

0.71 

1.93 

1.00 

0.02 

0.71 

1.93 

1.00 

0.02 

0.71 

1.93 

0.99 

0.01 

III 

0.72 

2.68 

0.99 

0.04 

0.73 

2.68 

0.99 

0.04 

0.73 

2.68 

0.99 

0.02 

50 IV 

0.71 

2.65 

0.75 

0.02 

0.71 

2.65 

0.73 

0.02 

0.71 

2.65 

0.66 

0.02 

V 

0.73 

3.32 

0.99 

0.07 

0.74 

3.32 

0.99 

0.06 

0.74 

3.32 

0.99 

0.05 

VI 

0.78 

2.47 

1.00 

0.03 

0.78 

2.47 

1.00 

0.03 

0.78 

2.47 

1.00 

0.02 

VII 

0.76 

3.04 

0.97 

0.02 

0.76 

3.04 

0.97 

0.02 

0.76 

3.04 

0.95 

0.01 
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Table 6: The impact of a on the estimated regime. In the header, a is the tuning parameter in 
the stopping criterion and n a is the regime such obtained. For each pair of regimes 7? a and 7T q /, 
we report the probability that they recommend the same treatment for a randomly selected 
patient in the population. Mathematically, this is to compute P{7r a (A") = 7r a /(X)\Tr a ,7r a '} 
and then average over 1000 replications, where X is generated in the same way as in Section 3 


in the main paper. 


V 

Setting 

7T0.9 VS. 7T0.95 

7T0.95 VS. 7T0.99 

7To.9 VS. 7T0.99 

Continuous response 


I 

0.998 

0.998 

0.996 


II 

0.986 

0.975 

0.961 


III 

0.993 

0.992 

0.986 

10 

IV 

0.997 

0.997 

0.993 


V 

0.999 

1.000 

0.998 


VI 

0.998 

0.997 

0.995 


VII 

0.991 

0.988 

0.979 


I 

0.984 

0.985 

0.970 


II 

0.986 

0.977 

0.962 


III 

0.988 

0.989 

0.976 

50 

IV 

0.998 

0.996 

0.994 


V 

0.993 

0.995 

0.988 


VI 

0.997 

0.997 

0.993 


VII 

0.991 

0.989 

0.980 

Binary response 


I 

0.971 

0.969 

0.941 


II 

0.978 

0.964 

0.944 


III 

0.971 

0.952 

0.926 

10 

IV 

0.983 

0.955 

0.941 


V 

0.973 

0.969 

0.944 


VI 

0.992 

0.993 

0.985 


VII 

0.976 

0.968 

0.944 


I 

0.973 

0.946 

0.920 


II 

0.980 

0.958 

0.942 


III 

0.971 

0.947 

0.925 

50 

IV 

0.985 

0.962 

0.947 


V 

0.965 

0.939 

0.913 


VI 

0.985 

0.985 

0.969 


VII 

0.974 

0.955 

0.930 
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12. Having economic problems: 1 if yes, 0 if no; 

13. Receiving psychotherapy for current depression: 1 if yes, 0 if no or don’t know; 

14. Receiving medication for current depression: 1 if yes, 0 if no or don’t know; 

15. Having received psychotherapy for past depressions: 1 if yes, 0 if no or don’t know; 

16. Having received medication for past depressions: 1 if yes, 0 if no or don’t know; 

17. Number of major depressive disorder (MDD) episodes: 1 if one, 2 if two, 3 if at least 
three; 

18. Length of current MDD episode (in years): continuous; 

19. Age at MDD onset: continuous; 

20. MDD severity: 1 if mild, 2 if moderate, 3 if severe; 

21. MDD type I: 1 if melancholic, 0 otherwise; 

22. MDD type II: 1 if atypical, 0 otherwise; 

23. Number of dysthymia episodes: 0 if zero, 1 if one, 2 if at least two; 

24. Generalized anxiety: 0 if absent or inadequate information, 1 if subthreshold, 2 if thresh¬ 
old; 

25. Anxiety disorder (not otherwise specified): 0 if absent or inadequate information, 1 if 
subthreshold, 2 if threshold; 

26. Panic disorder: 0 if absent or inadequate information, 1 if subthreshold, 2 if threshold; 

27. Social phobia: 0 if absent or inadequate information, 1 if subthreshold, 2 if threshold; 

28. Specific phobia: 0 if absent or inadequate information, 1 if subthreshold, 2 if threshold; 
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29. Obsessive compulsive: 0 if absent or inadequate information, 1 if subthreshold, 2 if 
threshold; 

30. Body dysmorphic disorder: 0 if absent or inadequate information, 1 if subthreshold, 2 if 
threshold; 

31. Post-traumatic stress disorder: 0 if absent or inadequate information, 1 if subthreshold, 
2 if threshold; 

32. Anorexia nervosa: 0 if absent or inadequate information, 1 if subthreshold, 2 if threshold; 

33. Alcohol abuse: 0 if absent, 1 if abuse, 2 if dependent; 

34. Drug abuse (including cannabis, stimulant, opioid, cocaine, hallucinogen): 0 if absent, 1 
if abuse, 2 if dependent; 

35. Global assessment of functioning: continuous; 

36. Chronic depression diagnosis I: 1 if no antecedent dysthymia and continuous full-syndrome 
type; 

37. Chronic depression diagnosis II: 1 if no antecedent dysthymia and incomplete recovery 
type; 

38. Chronic depression diagnosis III: 1 if superimposed on antecedent dysthymia; 

39. Chronic depression severity: integer between 1 (normal) and 7 (extremely ill); 

40. Hamilton anxiety rating scale (HAM-A) total score: continuous; 

41. HAM-A psychic anxiety score: continuous; 

42. HAM-A somatic anxiety score: continuous; 

43. Hamilton depression rating scale (HAM-D) total score: continuous; 


51 



44. HAM-D anxiety/somatic score: continuous; 

45. HAM-D cognitive disturbance score: continuous; 

46. HAM-D retardation score: continuous; 

47. HAM-D sleep disturbance: continuous; 

48. Inventory of Depressive Symptoms - Self Report (IDS-SR) anxiety/arousal score: con¬ 
tinuous; 

49. IDS-SR general/mood cognition score: continuous; 

50. IDS-SR sleep score: continuous. 

A. 10 Consistency of the decision list 

Since the consistency of the decision list is difficult to analyze theoretically, we present some 
empirical evidence that the decision list tends to be consistent. We follow the simulated 
experiments considered in Section 4 in the main paper but increase the sample size. We 
consider settings I and V only as the optimal regime in other settings cannot be representable 
as a decision list. 

The sample sizes considered and the associated results are presented in Table [7J For 
continuous response, the proposed method correctly identifies the form and the important 
covariates for treatment decision. As n increases, the loss in value decreases and the probabil¬ 
ity of recommending the best treatment increases. Also, the mean squared error of estimating 
the cutoff values decreases at the rate of n _1 . Results for binary response is qualitatively 
similar. Nevertheless, we may need a even larger sample size for the asymptotics to work. 
Therefore, the simulation results provides evidence that the proposed method is consistent. 
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Table 7: Consistency of the decision list. In the header, n is the sample size; p is the number 
of predictors. Loss is R( 7r opt ) — R(jt), namely, the difference between the the value under the 
estimated regime and the value under the optimal regime. Pr(best) is P{7r(X) = 7r opt (A")|7r}, 
namely, the probability that the treatment recommended by the estimate regime coincides 
with the treatment recommended by the optimal regime. Loss and Pr(best) are averaged 
over 1000 replications. Correct is the proportion of n having the same form and covariates 
as 7T opt among 1000 replications; MSEx is the mean squared error of the estimated cutoff for 


MSE 2 is the mean squared error of the estimated cutoff for X 2 . 


Setting 

n 

P 

Loss 

Pr(best) 

Correct 

MSEi(xn) 

MSE 2 (xn) 

Continuous 

response 






I 

10 4 

10 

0.0023 

0.9982 

1.00 

4.24 

6.87 

I 

10 5 

10 

0.0006 

0.9995 

1.00 

4.30 

6.50 

I 

10 6 

10 

0.0002 

0.9998 

1.00 

4.06 

6.32 

I 

10 4 

50 

0.0022 

0.9982 

1.00 

4.60 

6.91 

I 

10 5 

50 

0.0006 

0.9995 

1.00 

4.24 

6.49 

I 

10 6 

50 

0.0002 

0.9998 

1.00 

4.22 

6.48 

V 

10 4 

10 

0.0039 

0.9975 

1.00 

6.08 

5.27 

V 

10 5 

10 

0.0010 

0.9994 

1.00 

5.86 

4.70 

V 

10 6 

10 

0.0003 

0.9998 

1.00 

5.46 

4.54 

V 

10 4 

50 

0.0036 

0.9977 

1.00 

6.10 

5.33 

V 

10 5 

50 

0.0010 

0.9994 

1.00 

5.96 

4.54 

V 

10 6 

50 

0.0003 

0.9998 

1.00 

5.69 

4.51 

Binary response 

I 

10 4 

10 

0.0007 

0.9966 

1.00 

8.13 

10.93 

I 

10 5 

10 

0.0001 

0.9994 

1.00 

5.71 

6.05 

I 

10 6 

10 

0.0000 

0.9999 

1.00 

5.27 

5.61 

I 

10 4 

50 

0.0007 

0.9965 

1.00 

9.72 

11.15 

I 

10 5 

50 

0.0001 

0.9994 

1.00 

5.71 

6.29 

I 

10 6 

50 

0.0000 

0.9998 

1.00 

5.41 

5.78 

V 

10 4 

10 

0.0094 

0.9447 

0.79 

7.81 

22.66 

V 

10 5 

10 

0.0081 

0.9547 

0.96 

4.14 

6.33 

V 

10 6 

10 

0.0079 

0.9563 

0.97 

3.89 

5.03 

V 

10 4 

50 

0.0099 

0.9418 

0.70 

6.63 

14.36 

V 

10 5 

50 

0.0078 

0.9567 

0.96 

3.97 

6.14 

V 

10 6 

50 

0.0081 

0.9550 

0.97 

3.96 

5.10 
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